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Abstract 

The risk of reinsurance portfolios covering globally occurring natural catastrophes, such as earthquakes and hurri¬ 
canes, is quantified by employing simulations. These simulations are computationally intensive and require large 
amounts of data to be processed. The use of many-core hardware accelerators, such as the Intel Xeon Phi and the 
NVIDIA Graphics Processing Unit (GPU), are desirable for achieving high-performance risk analytics. In this paper, 
we set out to investigate how accelerators can be employed in risk analytics, focusing on developing parallel algo¬ 
rithms for Aggregate Risk Analysis, a simulation which computes the Probable Maximum Loss of a portfolio taking 
both primary and secondary uncertainties into account. The key result is that both hardware accelerators are useful 
in different contexts; without taking data transfer times into account the Phi had lowest execution times when used 
independently and the GPU along with a host in a hybrid platform yielded best performance. 

Keywords: hardware accelerators, many-core computing, secondary uncertainty, financial risk, catastrophic risk, 
risk analysis. Phi, GPU 


1. Introduction 

Risk analytics m has become an integral part of a 
business process in a range of domains (for example, 
E Ell a in m El). Large datasets are consumed by a 
risk model and hundreds of thousands or even millions 
of time consuming simulations are performed. Here the 
application of parallel and high-performance computing 
techniques are attractive. 

Interestingly, in the financial risk domain, specifically 
insurance and reinsurance, where data sizes are as large 
or even larger than what is employed in the above do¬ 
mains, relatively fewer parallel and high-performance 
computing techniques have been applied. Given the de¬ 
pendencies of the insurance and reinsurance setting on 
volatile markets, simulations that can be performed in a 
timely manner are essential. 
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1.1. Background 

Companies hold portfolios of contracts that cover 
risks associated with catastrophic events such as earth¬ 
quakes, hurricanes and floods. In order to have a mar¬ 
ketplace for such risk it is critical to be able to efficiently 
quantify individual risks and portfolios of risks. The 
analytical pipeline of the modern quantitative insurance 
or reinsurance company typically consists of three ma¬ 
jor stages, namely risk assessment EHOl, portfolio risk 
management and pricing {TH , and enterprise risk man¬ 
agement ca. 

In the first stage, catastrophe models Ifia are used 
to provide loss estimates by taking two inputs. Firstly, 
stochastic event catalogues which are a mathemati¬ 
cal representation of the natural occurrence patterns 
and characteristics of catastrophes. Secondly, exposure 
databases that describe thousands or millions of build¬ 
ings to be analysed, their construction types, location, 
value, use, and coverage. Each event-exposure pair is 
then analysed by a risk model that quantifies the hazard 
intensity at the exposure site, the vulnerability of the 
building and resulting damage level, and the expected 
loss, given the customer’s financial terms. The output of 
a catastrophe model is an Event Loss Table (ELT) which 
specifies the probability of occurrence and the expected 
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loss for every event in the catalogue. 

In the second stage of the analysis pipeline, portfo¬ 
lio risk management and pricing of portfolios of con¬ 
tracts necessitates a further level of stochastic simula¬ 
tion, called Aggregate Risk Analysis or referred to as 
ARA in this paper |[T4l[T5l[T6l[T7l[T8l. ARA is a Monte 
Carlo like simulation in which each trial represents an 
alternative view of which catastrophic events occur and 
in which order they occur within a predetermined period 
or a contractual year. In order to provide actuaries and 
decision makers with a consistent lens through which to 
view results, a pre-simulated Year Event Table (YET) 
containing a million alternative views of a single con¬ 
tractual year is employed. The output of ARA is a Year 
Loss Table, in which the results are highly aggregated. 

Erom the output of ARA, a reinsurer can derive im¬ 
portant portfolio risk metrics such as the Probable Max¬ 
imum Loss (PML) (191 EOl and the Tail Value-at-Risk 
(TVaR) (2T][22l. The output is then interpreted by actu¬ 
aries for key internal decision making, planning a finan¬ 
cial year and reporting to regulators and rating agencies. 
Eurthermore, these metrics then fiow into the final stage 
in the risk analysis pipeline, namely Enterprise Risk 
Management, where liability, asset, and other forms of 
risks are combined and correlated to generate an enter¬ 
prise wide view of risk. 

1.2. Problems 

There are two problems that need to be addressed 
for achieving high-performance risk analytics. Both 
problems can be solved if the data, memory and com¬ 
putational challenges of the analysis can be efficiently 
addressed. The first problem is related to developing 
methods for applying uncertainties. ARA presented 
above accounts for only ‘Primary Uncertainty’, which is 
the uncertainty associated with whether an event occurs 
or not in a simulated year. However, there is ‘Secondary 
Uncertainty’, which captures the uncertainty in the level 
of loss due to the use of simplified physical models and 
limitations in the available data. 

There are many sources of this uncertainty that need 
to be taken into account when considering catastrophic 
risk, including unknown exposure and hazard parame¬ 
ters and their interaction. Eor example, the exposure 
data which describes the buildings, their locations, and 
construction types may be incomplete, lacking in suffi¬ 
cient detail, or may be simply inaccurate. Also the phys¬ 
ical modelling of hazard, for example an earthquake, 
may naturally generate a distribution of hazard inten¬ 
sity values due to uncertainty associated with the en¬ 
ergy attenuation functions used or driving data such as 
soil type. Lastly, building vulnerability functions are 


simplifications of complex physical phenomenon and 
are therefore much better at producing loss distributions 
than accurate point estimates. Hence, there is a need 
to develop methods to not only capture primary uncer¬ 
tainty but also quantify secondary uncertainty in risk 
analysis. 

The analysis uses mean loss values when only pri¬ 
mary uncertainty is accounted for. Using such discrete 
values is an oversimplification, because in reality for 
any event there is a multitude of possible loss outcomes 
resulting in a distribution of potential losses. A simu¬ 
lation taking a distribution of losses requires statistical 
tools, for example, the beta probability distribution to 
estimate the loss using inverse beta cumulative density 
function which are both data intensive and computation¬ 
ally intensive. Such an analysis will need to accept as 
input complete event loss distributions represented by 
the event rate, mean loss, independent standard devia¬ 
tion, and correlated standard deviation, and better ac¬ 
count for the range of possible outcomes. 

The second problem is related to implementing par¬ 
allel risk analysis methods for achieving timely results. 
Erom a computational perspective the ARA simulation 
differs from other Monte Carlo simulations since tri¬ 
als are pre-simulated, rather than randomly generated 
on-the-fiy. This provides millions of alternate views 
of a contractual year comprising thousands of events 
which are pre-simulated as a YET. Erom an analytical 
perspective, a pre-simulated YET lends itself to statis¬ 
tical validation and to tuning for seasonality and clus¬ 
ter effects. However, there are significant challenges in 
achieving efficient parallelisation. The extremely large 
YET must be carefully shared between processing cores 
if the computation is to achieve good speed-up when 
there is limited memory bandwidth. 

1.2.1. Addressing the problems 

In this paper, we investigate hardware acceleration 
platforms for ARA. Parallel algorithms for ARA that 
initially take primary uncertainty into account is im¬ 
plemented. Eurther, a methodology that considers sec¬ 
ondary uncertainty is presented. The algorithms are im¬ 
plemented on the Intel Phi Coprocessor and an NVIDIA 
Graphics Processing Unit (GPU). Experimental studies 
evaluate how ARA performs on the hardware accelera¬ 
tors independently and along with a host processor. 

Both the Phi Coprocessor and the GPU are compet¬ 
ing hardware accelerators that offer alternative machine 
architectures to that of a regular CPU. While both hard¬ 
ware accelerators are significantly different, they pro¬ 
vide in common, firstly, lots of cycles for independent 
parallelism, secondly, fast memory access under the 
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right circumstances, and finally, fast mathematical com¬ 
putations. The parallel algorithms implemented in the 
paper take full advantage of the high levels of paral¬ 
lelism, fast shared memory access and fast numerical 
performance. In this research, the algorithm exploits the 
machine architecture of the accelerators to achieve par¬ 
allelism and fast memory access and for performing fast 
numerical computations. The result is a speed up of 16x 
- 2lx that is achieved on the Phi and GPU over respec¬ 
tive baseline sequential implementations on the CPU. 
The experiments show the feasibility of hardware accel¬ 
erators offering a relatively low cost high-performance 
computing solution for performing fast ARA. 

1.3. Related Work 

The domain of computational finance and risk ad¬ 
dresses problems related to achieving fast computations, 
surmounting challenges of data management and effi¬ 
ciently handling memory of computing systems. Par¬ 
allelism is exploited to achieve this (for example, 1231 
EH ESI Ei). Therefore, this domain is dependent on 
the advances in high-performance computing. Research 
on financial applications for production-based comput¬ 
ing systems have progressed from small scale clusters 
lEZlEll to large supercomputers lEHEOl. These ap¬ 
plications are hosted either on in-house clusters or on 
supercomputers. 

A number of financial applications are migrated from 
small clusters to multi-core and many-core processors 
which are available at a low budget ED. For ex¬ 
ample, research related to financial applications ex¬ 
ploiting the Cell BE processor is reported in (SH [33]i . 
FPGAs are another alternative platform presented in 
lEHEIlESlEIl- GPU acceleration is employed more re¬ 
cently (3Sl|39l|40l[4Tl. Heterogeneous clusters compris¬ 
ing hardware accelerators are now employed 1421143]| . 
In all the above research, the need for speeding up finan¬ 
cial applications are presented and is achieved. There is 
limited research on how high-performance computing 
advances can be applied to simulations in the risk do¬ 
main, specifically the insurance and reinsurance setting, 
which is considered in this paper. 

Although high-performance computing platforms are 
an option to accommodate and accelerate risk simula¬ 
tions, there is an investment cost that will need to be 
borne along with maintenance costs. A relatively cost 
effective solution is to employ hardware accelerators to 
address the problems faced by current risk simulations. 
Hardware accelerators can provide fast numerical com¬ 
putations which are required by statistical functions that 
support applying secondary uncertainty in ARA. Addi¬ 


tionally, accelerators can be exploited to achieve speed 
up and thereby use risk simulations in real-time. 

The remainder of this paper is organised as follows. 
Section proposes an algorithm for performing ARA 
with primary uncertainty. Section [^presents a method¬ 
ology for applying secondary uncertainty and the inputs 
required for the extended analysis. Section presents 
the implementations of algorithms both for incorporat¬ 
ing primary and secondary uncertainties on multi-core 
and many-core hardware platforms. Section consid¬ 
ers the results obtained from the implementations in the 
experimental studies. Section [^concludes this paper. 

2. Aggregate Risk Analysis 

Stochastic Monte Carlo like simulations are required 
for portfolio risk management and contract pricing, and 
Aggregate Risk Analysis (ARA) is one such simulation. 
The merit of performing the analysis is that millions of 
alternative views of a single contractual year can be ob¬ 
tained. This section will consider the inputs required 
for performing ARA, propose an algorithm for ARA, 
present the financial terms employed in the algorithm, 
and highlight the output of the analysis (refer Figure [T]). 

2.1. Inputs 

The following three data tables are required for per¬ 
forming ARA: 

i. The Year Event Table (YET), denoted as YET, is a 
database of pre-simulated occurrences of events from 
a catalogue of stochastic events. Each record in a 
YET called a Trial’, denoted as T/, represents a pos¬ 
sible sequence of event occurrences for any given 
year. The sequence of events is defined by an or¬ 
dered set of tuples containing the ID of an event and 
the time-stamp of its occurrence in that trial repre¬ 
sented as Ti — {{El l, ti l ),..., {Eij^, tij^^}. 

The set is ordered by ascending time-stamp values. 
A typical YET may comprise thousands to millions 
of trials. Each trial may have approximately be¬ 
tween 800 to 1500 ‘event time-stamp’ pairs based 
on a global event catalogue covering multiple perils. 
The YET can be represented as 

YET = {Ti = {{Ei,i,ti,i),...,{Ei,k,ti,k)}l 
where i = 1,2,... and k = 1,2,..., 1500. 

ii. Event Loss Tables, denoted as ELT, represent col¬ 
lections of specific events and their corresponding 
losses with respect to an exposure set. Each record in 
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Figure 1: Inputs and outputs of Aggregate Risk Analysis 
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an ELT is denoted as ELi = [Ei, //} and the financial 
terms associated with the ELT are represented as a 
tuple I = (Ji, J 2 ,...). 

A typical aggregate analysis may comprise 10,000 
ELTs, each containing 10,000 to 30,000 event losses 
with exceptions of even up to 2,000,000 event losses. 
The ELTs can be represented as 

[EiJi), 

with i = 1,2,..., 30,000. 


ELT 


ELi 

I 


iii. A Portfolio, denoted as PE contains a group of 
Programs, denoted as P represented as PE = 
{Pi,P 2 ,--- ,Pn}, with^ = 1,2,..., 10. 

Each Program in turn covers a set of Layers, de¬ 
noted as L, which covers a collection of ELTs un¬ 
der a set of layer terms. A single layer is com¬ 
posed of two attributes. Eirstly, the set of ELTs 6 = 
{ELTi, ELT 2 ,..., ELTj}, and secondly, the Layer 
Terms, denoted as 7“ = {ToccR.ToccL.TAggR.TAggL)- 
The Layer occurrence terms are (i) Occurrence Re¬ 
tention, denoted as TqccR^ which is the retention or 
deductible of the insured for an individual occur¬ 
rence loss, and (ii) Occurrence Limit, denoted as 
Toccl, which is the limit or coverage the insurer will 
pay for occurrence losses in excess of retention. The 
Layer aggregate terms are (i) Aggregate Retention, 
denoted as TAggR, which is the retention or deductible 
of the insured for an annual cumulative loss, and (ii) 
Aggregate Limit, denoted as which is the limit 

or coverage the insurer will pay for annual cumula¬ 
tive losses in excess of aggregate retention. 

A typical layer covers approximately 3 to 30 individ¬ 
ual ELTs. The Layer can be represented as 

6 = {ELTuELT2,...,ELTj}, 1 

"E = {"EoccR^ToccL^TAggR^TAggL) ) 

with j = 1,2,..., 30. 


Algorithm 1: Aggregate Risk Analysis with Pri- 

mary Uncertainty 


Input : 

YET, ELT, PE 

Output: YET 


1 for each Program, P, in PE do 

2 

for each Layer, L, in P do 

3 


for each Trial, T, in YET do 

4 



for each Event, E, in T do 

5 




for each ELT covered by L do 

6 





Lookup E in the ELT and find 
corresponding loss, Ie 

7 





Ie <— Apply Linancial 

TermsC/^) 

8 





It ^— It Ie 

9 




end 

10 




It <— Apply Occurrence Linancial 
Terms {It) 

11 




It <— Apply Aggregate Linancial 
Terms (It) 

12 



end 


13 


end 



14 

end 




15 end 





16 Populate YET using It 


2.2. Algorithm 

Algorithm presents two stages for performing ag¬ 
gregate analysis with primary uncertainty. In the first 
stage, data is loaded into local memory, referred to as 
the pre-processing stage. In this stage the YET, ELT 
and PE, are loaded into memory. 

In the second stage, the four step simulation for each 
Layer and for each Trial in the YET is performed and 
the Year Loss Table {YET) is produced. 

In the first step shown in line no. 6, each event of a 
trial and the corresponding event loss in the set of ELTs 
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associated with the Layer are determined. In the second 
step shown in line no. 7, a set of contractual financial 
terms are applied to each loss value of the Event-Loss 
pair extracted from an ELT to the benefit of the layer. 
Eor this the losses for a specific event’s set of financial 
terms I are accumulated across all ELTs into a single 
event loss shown in line no. 8. 

In the third step (line no. 10), the event loss for 
each event occurrence in the trial, combined across all 
ELTs associated with the layer, is subject to occur¬ 
rence terms. Occurrence terms are applicable to in¬ 
dividual event occurrences independent of any other 
occurrences in the trial. The occurrence terms cap¬ 
ture specific contractual properties of ’eXcess of Loss’ 
omi treaties as they apply to individual event occur¬ 
rences only. The event losses net of occurrence terms 
are then accumulated into a single aggregate loss for 
the given trial. The occurrence terms are applied as 
It = min(max(lT - TqccR, 0 ), Toccl)^ 

In the fourth step (line no. 11), aggregate terms are 
applied to a trial’s aggregated loss of a layer. Unlike 
occurrence terms, aggregate terms are applied to the cu¬ 
mulative sum of occurrence losses within a trial, and 
thus, the result depends on the sequence of prior events 
in the trial. This behaviour captures contractual prop¬ 
erties as they apply to multiple event occurrences. The 
aggregate loss net of the aggregate terms is referred to 
as the trial loss or the year loss. The aggregate terms are 
applied as Ij = min(max(lT - T^AggR, 0), TAggi)- 

2.3. Output 

The output of ARA is a loss value associated with 
each trial of the YET. A reinsurer can derive impor¬ 
tant portfolio risk metrics such as the Probable Max¬ 
imum Loss (PML) and the Tail Value-at-Risk (TVaR) 
which are used for both internal risk management and 
reporting to regulators and rating agencies. Eurther- 
more, these metrics are used in Enterprise Risk Man¬ 
agement, where liability, asset, and other forms of risks 
are combined and correlated to generate an enterprise 
wide view of risk. 

Additional functions can be used to generate reports 
that will aid actuaries and decision makers. Eor exam¬ 
ple, reports presenting Return Period Losses (RPL), Re¬ 
gion/Peril losses, Multi-Marginal Analysis and Stochas¬ 
tic Exceedance Probability (STEP) Analysis. 

3. Applying Secondary Uncertainty 

The methodology to compute secondary uncertainty 
builds on industry wide practices. The inputs required 


for applying secondary uncertainty and the sequence of 
steps are considered in this section. 

3.1. Inputs 

Eor performing ARA accounting for secondary un¬ 
certainty requires the following additional inputs: 

Z{Prog,E) = P{Prog,E) ^ U(0,1), referred to 

as the Program-and-Event-Occurrence-Specific ran¬ 
dom number. Each Event occurrence across different 
Programs have different random numbers. 

ii. € f/(0,1), referred to as the Event- 
Occurrence-Specific random number. Each Event 
occurrence across different Programs have the same 
random number. 

iii. pi, referred to as the mean loss is the loss associated 
with an event Et. 

iv. (Ti, referred to as the independent standard deviation 
of loss, represents the variance within the event-loss 
distribution. 

V. (Tc, referred to as the correlated standard deviation 
of loss, represents the error of the event-occurrence 
dependencies. 

vi. maxi, referred to as the maximum expected loss. 

To capture the above additional inputs required for 
computing secondary uncertainty, the inputs used in 
ARA with primary uncertainty need to be modified. The 
Year Event Table (YET), denoted as YET is redefined as 

YET = [Ti = {{Ei^iJi^i,Z{Prog,E\x)^ • • •, 

U,k^ Z(Prog,E)ik)}}i 

where i = 1,2,... and k = 1,2,..., 1500. 

Extended Event Loss Tables, denoted as XELT, 
are used instead of ELT. Each record in an 
XELT is denoted as ‘extended’ Event Loss XELi = 
[Ei, li, Z(E)i, o-ii,crc-, maxi-} and the financial terms asso¬ 
ciated with the XELT are represented as a tuple I = 

The extended ELT is represented as XELT = 

XELi = [Ei , Pu ’ Z{E)i , so- 1 -, o-Q , maxi -}, 1 

I = (Ji,J2 ,...) y 

with i = 1,2,..., 30,000. 

The representation of the Layer is modified as L = 

8 = {XELTuXELT2,...,XELTj}, 1 

‘E = (ToccR,'l~OccL,'l~AggR,1~AggL) J’ 

with j = 1,2,..., 30. 
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3.2. Combining standard deviation 

Given the above inputs, the independent and corre¬ 
lated standard deviations need to be combined to reduce 
the error for estimating the loss value associated with an 
event. This is done in the following five steps: 

i. The raw standard deviation is produced as 

(T = (T/ -r (Tc. 

ii. The probabilities of occurrences, Z{Prog,E) and Z{e) are 
transformed from uniform distribution to normal dis- 

tribution using /(x;yu, cr^) = f ) dx. 

—oo 

This is applied to the probabilities of event occur¬ 
rences as 

^(Prog,E) ~ f(.Z(Prog,E)i 0,1) 6 A^(0,1) and 
V(E) = f (Z(E)\0, 1) G N(0, 1). 

iii. The linear combination of the transformed probabili¬ 
ties of event occurrences and the standard deviations 
is computed as 

LC = V(Prog,E){^) + 

iv. The normal random variable is computed as 


V. The normal random variable is transformed from 
normal distribution to uniform distribution as 

V _^2 

2 = 0(v) = pNormiv) = ^ f e~rdt. 

— OO 

The model used above for combining the indepen¬ 
dent and correlated standard deviations represents two 
extreme cases. The first case in which ctj = 0 and the 
second case in which crc = 0. The model also ensures 
that the final random number, z, is based on both the 
independent and correlated standard deviations. 

3.3. Estimating Losses 

The loss is estimated using the Beta distribution since 
fitting such a distribution allows the representation of 
risks quite accurately. The Beta distribution is a two pa¬ 
rameter distribution, with an upper bound for the stan¬ 
dard deviation. The standard deviation is o-r = and 

P maxi 

the mean is lir = -23 -. The alpha value is 

maxi ^ 


-l)’ 

and the beta value is 

An upper bound is set to limit the standard devia- 
tion using if then 

CTp = . In the algorithm reported in this paper, for 

numerical purposes, a value very close to is cho¬ 
sen. 

The loss after applying beta distribution functions are 
obtained as 

Loss = maxi * InvCDLjjeta{z\ Q',/3), 
InvCDFbeta{z\a,p) = 

where B{z\ 0^,13) = ^ - tf~^dt. 

0 

Algorithm requires the redefined inputs for ap¬ 
plying secondary uncertainty. The modified YET, the 
XELT and the modified Layer, L in PL considered in 
this section are used as inputs. The above steps to com¬ 
pute secondary uncertainty are incorporated after line 
no. 6 and before applying the financial terms to each 
event loss. 

4. Implementation 

The hardware platforms on which ARA is imple¬ 
mented are firstly considered in this section, followed 
by a discussion on the data structures used in the im¬ 
plementation of ARA and on the methods for comput¬ 
ing secondary uncertainty. Optimisations incorporated 
in the implementations are further considered. 

4.1. Experimental Platforms 

Four hardware platforms ranging from desktop 
CPUs, such as the Intel i7, a non-consumer worksta¬ 
tion/server, such as the Intel Xeon, and accelerators, 
such as the NVIDIA GPU and the Intel Phi, are used for 
implementing sequential and parallel ARA algorithms. 

Firstly, a multi-core CPU is employed whose speci¬ 
fications are a 3.40 GHz quad-core Intel(R) Core (TM) 
i7-2600 processor with 16.0 GB of RAM. The proces¬ 
sor has 256 KB L2 cache per core, SMB L3 cache and 
a maximum memory bandwidth of 21 GB/sec. 

Secondly, two multi-core CPUs are employed whose 
specifications are a 2.00 GHz octa-core Intel (R) 
Xeon(R) E5-2650 processor with 256 GB of RAM. The 
processor has a 20 MB cache and a maximum memory 
bandwidth of 51.2 GB/sec. 
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Both the above processors support Advanced Vec¬ 
tor Extensions (AVX) instructions for vector operations. 
Both sequential and parallel versions of the ARA algo¬ 
rithm were implemented on these platforms. The se¬ 
quential version adopted C-h-r, while the parallel ver¬ 
sion was implemented in C-r-h and OpenMR The ver¬ 
sions were compiled using the icpc compiler, version 
13.1 provided by the Intel(R) Compiler Suite. The 
‘-03’ compiler flag was used for optimisation, and the 
OpenMP directive and the Intel Math Kernel library 
were included. 

Thirdly, an NVIDIA Tesla C2075 GPU, consisting of 
448 CUDA cores, each with a frequency of 1.15 GHz, 
a global memory of 6 GB and a memory bandwidth of 
144 GB/sec. The peak double precision floating point 
performance is nearly 0.515 Tflops. The implementa¬ 
tion of the algorithm is compiled using the NVIDIA 
CUDA Compiler (nvcc), version 5.(Q The implementa¬ 
tion is compiled using ‘-arch sm_13’. 

Fourthly, an Intel Xeon Phi Coprocessors 51 lOP con¬ 
sisting of 60 cores with a frequency of 1.053 GHz. The 
coprocessor supports a maximum of 240 threads, and 
a memory of 8 GB and a memory bandwidth of 320 
GB/sec is available. The peak double precision float¬ 
ing point performance is close to one Tflop. The Phi 
Coprocessor is based on the Intel (R) Many Integrate 
Cores (MIC) architecture and supports the Intel Initial 
Many-Core Instructions (IMCI). 

The GPU is employed in two ways. Firstly, in a hy¬ 
brid mode, which refers to execution both on the host 
and the accelerator (operates in asynchronous mode and 
performs side-by-side processing with the host device). 
Secondly, in a non-hybrid mode, which refers to the host 
supporting the data pre-processing and transfer activi¬ 
ties and the accelerator performs all the computation. 
The Phi on the other hand is not only employed in the 
hybrid mode, but also in the native mode, which refers 
to the execution of the code on the accelerator indepen¬ 
dent of the host device. 

In this paper, sequential and parallel implementations 
of ARA on all the above four platforms, on the hybrid 
of i7 and GPU, and on the hybrid of Xeon and Phi are 
considered. Four statistical libraries, namely the Intel 
MKF, Boost, Betajic and CUDA Math are explored for 
applying secondary uncertainty. 

4.2. Implementing Data Structures for the Algorithm 

In ARA, the losses of events in a trial need to be de¬ 
termined by looking up losses in the XEFT. The key 
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design question is whether the data structure contain¬ 
ing the event-loss pairs of all trials need to be a sparse 
matrix in the form of a direct access table or a com¬ 
pact representation. While fast lookups can be ob¬ 
tained in the sparse matrix representation, this perfor¬ 
mance is achieved at the cost of high memory usage. 
Consider a YET with 1,000,000 events and one Eayer 
with 16 XEETs, and each XELT consisting of 20,000 
events with non-zero losses. The representation us¬ 
ing a direct access table would require memory to hold 
1,000,000 X16 = 16,000,000 event-loss pairs (without 
considering the data required for computing secondary 
uncertainty). While such a large data structure is held in 
memory, 15,700,000 events represent zero loss value. 

The sparse representation requires large amount of 
memory, but it is chosen over any compact representa¬ 
tion for the following reason. A search operation is re¬ 
quired to And an event-loss pair even in a compact rep¬ 
resentation. If sequential search is adopted, then 0{n) 
memory accesses are required to And an event-loss pair. 
If sorting is performed in a pre-processing phase to fa¬ 
cilitate a binary search, then 0{log{n)) memory accesses 
are required to And an event-loss pair. If a constant¬ 
time space-efficient hashing scheme, such as cuckoo 
hashing 1461 is adopted then an event-loss pair can be 
accessed with a constant number of memory accesses. 
However, this can be only be achieved at the expense of 
a complex implementation and overheads depreciating 
run-time performance. Further, such an implementation 
on hardware accelerators with complex memory hier¬ 
archies is cumbersome. Although large memory space 
is required for a direct access table, looking up event- 
loss pairs can be achieved with fewer memory accesses 
compared to the memory accesses in a compact repre¬ 
sentation. 

The GPU implementation of the algorithm uses 
global memory to store all data structures. The par¬ 
allel implementations on the GPU require high mem¬ 
ory transactions. This is surmounted by utilising shared 
memory wherever possible over global memory, but 
with challenges in dividing data according to memory 
access patterns of the threads. 

Two data structure implementations of 16 XELTs 
were considered for the GPU. In the first implementa¬ 
tion, each XEFT is considered as an independent table; 
therefore, in a read cycle, each thread independently 
looks up its events from the XELTs. All threads within 
a block access the same XELT. 

In the second implementation, the 16 XELTs are 
combined into a single table. In this case it is challeng¬ 
ing to divide data according to the access patterns. This 
results in threads within a block requiring to access dif- 
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ferent XELTs and loading many rows of the XELT into 
memory for each block. There are higher memory ac¬ 
cesses and transactions between the global and shared 
memories so that the threads can access the required 
events and losses from the XELTs. The implementa¬ 
tion with combined XELTs performed poorly compared 
to the implementation which uses independent XELTs. 
When only primary uncertainty was considered the ex¬ 
ecution time increased by nearly 15% and there is a fur¬ 
ther increase when secondary uncertainty is taken into 
account. 

4.3. Implementing Methods to Compute Secondary Un¬ 
certainty 

Three statistical functions are required in the method 
for applying secondary uncertainty. They are (i) the Cu¬ 
mulative Distribution Eunction (CDE) of Normal distri¬ 
bution, (ii) the Quantile of the CDE of Normal distribu¬ 
tion, and (iii) the Quantile of the Beta distribution. The 
Quantile of the Beta distribution is a numerically inten¬ 
sive function since an iterative method is required for 
converging at a solution within a certain error bound. 

Lour statistical libraries are used to implement the 
statistical functions required for applying secondary un¬ 
certainty. Eirstly, the Intel MKL library is used in the 
implementations on the CPU and Phi for the CDE of the 
Normal distribution and the Quantile of the Normal dis¬ 
tribution ||47]| . The Intel MKL API currently does not 
support Beta distribution functions. 

Secondly, the CUDA Math librar}0is employed. The 
CDE of the Normal distribution and the Quantile of the 
Normal Distribution are fast methods that are included 
in the implementation. The CUDA Math API currently 
does not support Beta distribution functions. 

Thirdly, the Boost statistical library is offered by the 
Boost C-\-\- librarie^ However, Boost is currently not 
supported for the GPU platform. 

Eourthly, BETA_NC another C-r-r library that can 
evaluate the CDE of the Noncentral Beta distribution 
is employed 14^ . This library is ported onto the GPU 
platform and all the functions in the libraries are imple¬ 
mented as __device__ functions for the GPU. This func¬ 
tion is also used for the Phi. 

4.4. Optimising the Implementation on the CPU 

The optimised implementation on the CPUs and the 
Phi coprocessor takes advantage of vectorisation by 


^ http://docs.nVidia.com/cuda/cuda-math-api/index. 
html 

^ http://WWW.boost.org/ 


making use of Intel’s AVX and Phi’s IMCI instructions. 
Single Instruction Multiple Data (SIMD) instructions 
facilitate eight (in the case of i7 or Xeon) and sixteen 
(on the Phi) single-precision floating point operations in 
the time they take for completing a single floating point 
operation. 

ARA incorporating primary uncertainty does not ben¬ 
efit from vectorisation as most computations are mem¬ 
ory intensive. However, in ARA incorporating sec¬ 
ondary uncertainty, a large number of floating point 
arithmetic is performed making it a better candidate to 
be vectorised. Eor vectorising the function which ap¬ 
plies secondary uncertainty, firstly, several single pre¬ 
cision floating point values are modified to single pre¬ 
cision arrays. Secondly, instead of applying secondary 
uncertainty on every Event loss (as in the regular imple¬ 
mentations), an array is populated with the Event loss 
values. Then secondary uncertainty is applied on the ar¬ 
ray. Such an implementation makes room for perform¬ 
ing a number of floating point operations at the same 
time. 

Eunctions using the Boost library are replaced with 
equivalents provided by the Intel MKL library as they 
are optimised to perform SIMD instructions on arrays. 
The basic floating point arithmetic in secondary un¬ 
certainty was vectorised with the help of the #pragma 
ivdep directive, which tells the compiler that it is safe 
to vectorise the arrays. 

Eurther, to exploit the benefits of vectorising, the in¬ 
verse CDE of the Beta distribution was implemented 
incorporating SIMD instructions. The Betamc library 
was found to lend itself well to vectorisation since it 
had very few branch conditions, such as break or early 
return statements. Branch conditions make the use of 
SIMD instructions more difficult. The inverse CDE of 
the Beta distribution was vectorised in a similar way to 
which the function applying secondary uncertainty was 
vectorised; using an array instead of individual floating 
point values. 

The function for computing the inverse CDE of the 
Beta distribution incorporates an algorithm for conver¬ 
gence of the solution. One condition for ensuring con¬ 
vergence is based on the desired accuracy of the result. 
In some cases a large number of iterations are required 
for the solution to converge causing the computations to 
take more time. To optimise the function further, solu¬ 
tions which converge quickly are switched out and re¬ 
placed with new values whose inverse CDE need to be 
found. In this way, the time spent on cases requiring 
large times for convergence can be mitigated by per¬ 
forming convergence in other cases. 
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4.5. Optimising the Implementation on Hardware ac¬ 
celerators 

Two approaches were considered for performing 
ARA with primary and secondary uncertainty on the In¬ 
tel Phi Coprocessor and the NVIDIA GPU. 

For the Intel Phi coprocessor, in the first approach, 
referred to as the native approach, the source code for 
the Xeon platform was built for the Intel Many Inte¬ 
grated Core (MIC) architecture using the Intel com¬ 
piler’s -mmic fiag. The executable along with dynamic 
libraries required and all input data were copied onto 
the coprocessor using the scp command. A connection 
to the Coprocessor was established using ssh and ARA 
was performed the same way as it was on the Xeon plat¬ 
form. Though this approach was easy to adopt, it was 
not easy to run a 800,000 trial simulation since the time 
taken for copying the input data onto the coprocessor 
from the host machine was 10 times more than it did for 
reading input data on the Xeon platform. 

For the Intel Phi coprocessor a second approach is im¬ 
plemented, referred to as the hybrid approach, in which 
the host (Xeon processor) is combined with the Phi Co¬ 
processor. In this implementation, the input data is ini¬ 
tially loaded into the host memory. The processor then 
writes a set of trials onto a buffer and sends the buffer 
to the device (Phi Coprocessor). The Coprocessor then 
applies secondary uncertainty and aggregates the results 
as needed in ARA for applying primary and secondary 
uncertainty. The device then writes back onto a buffer 
on the host. This implementation can accommodate 
800,000 trials unlike the first implementation since lim¬ 
ited amount of trials are only handled at once on the 
device. 

The connection from the host to the device is estab¬ 
lished using Intel’s Core Offload Infrastructure (COI) 
library. A process is created on the device using the 
library once the source starts executing, and sleeps un¬ 
til the host instructs the COI library to start the device 
function for applying secondary uncertainty and per¬ 
forming aggregation. 

The problem of the Coprocessor memory unable to 
accommodate the entire simulation in the native ap¬ 
proach is surmounted by iteratively sending a chunk 
of trials through a buffer. Better performance can be 
achieved in the hybrid approach by having the host and 
device perform the tasks asynchronously. For example, 
while the device is applying secondary uncertainty and 
aggregating a chunk of trials, the host could start chunk¬ 
ing the trials for the next buffer, or apply secondary un¬ 
certainty and perform aggregation for a different chunk 
of trials. 


For the GPU implementations, in the hybrid and non¬ 
hybrid mode, data from both shared and global memo¬ 
ries are migrated to the kernel registry, which has low 
latency. The numerical computations are made slightly 
faster by using the CUDA Math APIs support for float¬ 
ing point operations (the compiler fiag -use_fast jnath 
is included). 

5. Experimental Results 

In this section, the results obtained from the exper¬ 
imental studies are presented. Typical industry sim¬ 
ulation specifications are considered; from 200,000 to 
800,000 trials with each trial comprising 1,000 events, 
and one Layer covering 16 XELTs (industry practition¬ 
ers were consulted in choosing the size of the simu¬ 
lation). The combined size of the YET comprising 
800,000 trials, the 16 XELTs, and the metadata that de¬ 
fines one Layer is approximately 28 GB for each Layer. 
So if A Layers are to be used in ARA then the input 
data will be A x 28 GB. The size of the input data is re¬ 
duced when the ARA only considers PU since the YET 
and ELTs are not as large as when SU is considered. We 
refer to ‘ARA with PU’ when only primary uncertainty 
is considered and ‘ARA with SU’ when secondary un¬ 
certainty is also taken into account. A single library (for 
example. Boost) or a combination of libraries (for ex¬ 
ample, MKL and Beta_nc) is employed for the statistical 
functions in secondary uncertainty. 

Figure and Figure are the graphs obtained for 
the sequential implementation. Figure to Figure 
are the results obtained for parallel implementations on 
the CPUs, on hardware accelerators and on hybrid plat¬ 
forms. The summary of the results from both sequential 
and parallel implementations are presented in Figurep^ 

5.1. Sequential Implementation 

Figure!^ and Figureshow the results from the i7 
and Xeon CPUs respectively, when the Boost, MKL and 
Beta_nc libraries are employed to sequentially perform 
ARA. In the best case, the i7 is nearly 49% faster than 
Xeon for ARA with PU and 95% faster than Xeon for 
ARA with SU, resulting in the i7 being 1.32 times faster 
than Xeon. 

Surprisingly, the combination of MKL and the vec¬ 
torised Beta_nc libraries are outperformed by the com¬ 
bination using MKL and non-vectorised Betamc library. 
The vectorised code performs well when the precision 
of the inverse Beta CDF is high. In all the experiments 
presented in this paper, the solution for the inverse Beta 
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Figure 2: Sequential ARA with PU and SU for 200,000 
to 800,000 trials using the Boost, MKL and Beta_nc li¬ 
braries on CPU 


CDF converges when there is accuracy for up to six dec¬ 
imal places (or relative error is less than 10“^). How¬ 
ever, if the precision of the solution is increased for 
greater accuracy, for example, twelve decimal places, 
then the vectorised library improves the performance 
significantly. When the precision is low only few it¬ 
erations are required for the solution to converge; few 
iterations cannot reap the benefit of vectorisation, but 
introduces overheads which is a trade-off. 

All possible combinations of the libraries for the CDF 
and the Quantile of the CDF for the Normal distri¬ 
bution and the Quantile of the Beta distribution were 
employed. On both the i7 and Xeon, the combina¬ 
tion of the MKL (for the Normal distribution functions) 
and Betamc (for the Beta distribution function) libraries 
gave the best results. One advantage of using the MKL 
library is that it supports the Intel AVX instructions and 
performs well on Intel hardware. 


Figure 3: Time taken for sequential ARA with PU and 
SU using the Boost library on CPU 

Figurej^shows the individual times taken to apply PU 
and SU on the i7 and Xeon CPUs when Boost library is 
used. The time for applying SU is nearly 2.5 times the 
time taken for ARA in each case of trials shown in the 
graph. The mathematical functions employed for ap¬ 
plying SU are fast methods, although the inverse CDF 
of the beta distribution which takes majority of the time 
is an exception; over 95% of the time taken for SU is 
required by the inverse CDF. The time for applying sec¬ 
ondary uncertainty is nearly 1.5 - 1.7 times the time 
taken for ARA in each case of trials shown in the graph. 

The time taken for ARA with PU and SU with in¬ 
creasing number of trials should scale linearly. This is 
observed both in Figure and Figure 

5.2. Parallel Implementations 

The results obtained from parallel implementations 
on the multi-core CPUs are shown in Figure]^ to Figure 
The results when many core hardware accelerators 
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are used in the native mode or individually are shown 
in Figure to Figure Results obtained on hybrid 
platforms comprising multi-cores and many-cores are 
shown in Figure to Figure [T^ 


5.2.1. On CPU 

Figure shows the time taken for performing ARA 
with PU and SU when the number of threads are varied 
from 1 to 8 on the i7 (Figure and the Xeon (Fig¬ 
ure]^ platforms using the Boost, MKL and Beta_nc li¬ 
braries for 800,000 trials. A single thread is run on each 
core of the CPU and the number of cores are varied from 
1 to 8. Each threads performs ARA with PU and SU for 
a single trial. Multiple threads are used by employing 
OpenMP directive #pragma omp parallel in the C-I--I- 
source. With respect to the overall time on the i7 and 
Xeon CPUs, the MKL and Betamc combination per¬ 
forms the best requiring around 113 seconds. On both 
platforms, the performance of the standalone Boost li¬ 
brary is poorer than the combination of libraries. The 
vectorised library does not perform better than the un¬ 
vectorised library. This is so, since the benefits of vec- 
torisation are evident only when high precision is re¬ 
quired from the solution of the inverse Beta CDF. While 
there are only few iterations required for converging to 
six decimal place precision, as required in this research, 
the benefits of vectorisation are balanced off by the over¬ 
heads in the vectorised library. 

In the case of the best performer, the MKL and the 
Betamc library, a speed up of 4.6x is obtained on i7 and 
7.6x is obtained on Xeon. The performance on the Xeon 
is better due to the larger number of cores compared to 
the i7 and the bandwidth to memory on the Xeon which 
is 2.4 times that on the i7. The majority of the time 
taken to compute primary uncertainty is for performing 
random access reads into the data structure represent¬ 


ing the XELT (this is further discussed in Section 5.3 


on the i7 and Xeon, the time taken for such memory 
operations is nearly five times the time taken for the re¬ 
maining computations while applying financial terms in 
PU, also shown later in Figure [^. The majority of the 
time for applying secondary uncertainty in ARA is con¬ 
sumed in the inverse cumulative distribution function of 
the beta distribution (nearly 95% of the time taken for 
SU). 


Figure shows the graphs plotted for the time taken 
for performing ARA accounting both for PU and SU 
when the number of threads are varied from 16 to 2048 
on the i7 (Figurej^ and Xeon (Figurej^ platforms us¬ 
ing the Boost, MKL and Betamc libraries for 800,000 
trials. Multiple threads are run on each core of the CPU 
with the exception of the Xeon until 32 threads are used. 



—^ ARA with PU 


-A- ARA withSU- 
Boost 


ARA with SU - 
Boost + Beta_nc 


)< ARA with SU - 
MKL + Boost 
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MKL + Beta_nc 
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(a) i7 



(b) Xeon 


Figure 4: Time taken for ARA with PU and SU using 
one thread per computing core on multi-core CPUs for 
800,000 trials 
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For example, when 16 threads are employed on the 17 
and Xeon two threads and one thread run on each core 
respectively, and when 2048 threads are employed on 
the 17 and Xeon 256 threads and 128 threads run on each 
core respectively. In practice, the cores are not gener¬ 
ally over subscribed, but is presented in the experiments 
to show the effect of over subscription. For example, 
super linear speed ups are observed due to hyperthread¬ 
ing. On the i7 a small drop is observed first in the abso¬ 
lute time followed by a slight but gradual increase when 
many threads are executed on each core. However, on 
the Xeon, before there is a similar observation, there is 
a steep drop in the time moving from 16 to 32 threads. 
This is because the Xeon has 16 cores and the timings 
start to stagnate only after all the 16 cores are used. On 
the i7 the best result is 139 seconds compared to 48 sec¬ 
onds on the Xeon; the better performance by a factor 
of 3 is attributed to twice the number of cores and 2.6 
times the bandwidth to memory compared to the 17. 

Figure shows the graphs plotted for the time taken 
for ARA with PU and SU on the i7 and Xeon plat¬ 
forms for 32 threads (best performance is obtained for 
32 threads on both platforms). The time taken for ap¬ 
plying SU increases with the number of trials in each 
case of trials shown in the graph. The time taken for 

1 th 

applying SU on the i7 is only ^ the time taken in the 
sequential implementation on the 17. The time taken for 

1 th 

ARA with SU on the Xeon is only ^ the time taken in 
the sequential implementation. The Xeon is nearly three 
times faster than the i7 for applying PU and upto 2.6 
times faster for applying SU. Overall a speed up of ap¬ 
proximately 5x and 18x is obtained on the i7 and Xeon 
over their respective sequential implementations. 


5.2.2. On Hardware Accelerators only 

Figure and Figure show the graphs plotted for 
the time taken for performing ARA considering both PU 
and SU for 800,000 trials on the many-core GPU using 
the CUBA Math and Betamc libraries. The Boost li¬ 
brary is not available for GPUs. 

In Figure the analysis is performed for 800,000 
trials on the GPU by varying the number of threads per 
block from 16 to 512 threads. An improvement in the 
performance is seen as the number of threads per block 
increase from 16 to 128 since the latency for accessing 
the global memory drops. Beyond 128 threads, the per¬ 
formance starts to drop, since the shared and constant 
memory available to each thread decreases. The low¬ 
est time taken is 55 seconds when 128 threads per block 
are used, which is nearly 12 times faster than the best se¬ 
quential performance on i7 and 15 times faster than the 
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Figure 5: Time taken for ARA with PU and SU using 
multiple thread per computing core on multi-core CPUs 
for 800,000 trials 
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Figure 6: Time taken for ARA with PU and SU using 
MKL and Beta_nc libraries using 32 threads on multi¬ 
core CPUs 




(b) Using 128 threads per block for varying number of trials 


Figure 7: Time taken for ARA on the GPU 

best sequential performance on Xeon. The overall time 
for the analysis on the GPU is comparable to the per¬ 
formance of the multi-threaded implementation of 2048 
threads on the Xeon. 

Figure [7^ shows the time taken for performing ARA 
incorporating PU and SU using 128 threads per block 
on the GPU. Both times scale linearly as expected. The 
time taken for applying SU is over twice the time taken 
for performing ARA. 

Figure and Figure show the experimental results 
obtained when the Intel Phi alone is employed for per¬ 
forming ARA with PU and SU using Boost, MKL and 
Beta_nc libraries. Figure [8^ shows the performance of 
ARA when one thread is executed on each core of the 
Phi (the number of threads is varied from 1 to 60). As 
expected there is an increase in speed when the num¬ 
ber of threads is increased. The fastest time obtained is 
78.32 seconds on 60 threads using MKL and Betamc. 
The combination of Boost and Beta_nc also perform 
comparably. ARA with PU consists mostly of memory 
related operations and does not experience any signif¬ 
icant speedup. For example, 12 seconds are required 
on 30 threads as against 7 seconds on 60 threads. Over 
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(a) Single thread per core 



Figure 8: Time taken for ARA with PU and SU using 
Intel Phi for 800,000 trials 


90% of the time is required for applying secondary un¬ 
certainty when 60 threads are used. Vectorisation does 
not yield any potential benefit over non-vectorised func¬ 
tions in applying secondary uncertainty due to over¬ 
heads; vectorised function is nearly 38% slower than 
the fastest non-vectorised function on 60 threads. 

Figure shows the performance of ARA when PU 
and SU are applied using multiple threads per core 
on the Phi (number of threads is varied from 60 to 
240). Speedup is observed although there is limited ef¬ 
ficiency. Again best performance is noted when MKL 
and Betamc libraries are employed. The analysis is 
completed in 52.16 seconds and 42.1 seconds when two 
threads per core and three threads per core are used re¬ 
spectively. 

Figure shows that there is a linear increase in the 
time taken for applying PU and SU on the Phi processor 
for varying number of trials. The time taken for SU is 
over 8 times the time taken for PU; the for SU is just 
under 38 seconds both for the Phi and the GPU. The 



No. of trials 


■ Time for SU 


Time for PU 


Figure 9: Times taken for applying PU and SU in ARA 
on the Phi for varying number of trials 


real difference is observed for PU in that the Phi is four 
times faster than the GPU for performing memory com¬ 
putations. 


5.2.3. On Hybrid platforms 

Figure [T^ shows the time taken for performing ARA 
on the i7 CPU and the Tesla C2075 GPU used as a hy¬ 
brid platform. The workload of 800,000 trials is split be¬ 
tween the CPU and the GPU, with the intention that the 
CPU does not remain idle, in contrast to when the GPU 
alone is employed. The CPU makes use of the multi¬ 
cores of the processor and 16 threads using OpenMP to 
execute the workload; the MKL and Beta_nc libraries 
are employed for applying secondary uncertainty. The 
workload on the GPU makes use of the many cores and 
128 threads per core; the combination of the CUBA 
Math and Beta_nc libraries are made use of to apply 
secondary uncertainty. The x-axis of the graph shows 
the varying workload of trials on the CPU and the GPU 
and the corresponding time taken for the workload on 
each platform. The best case scenario is when the CPU 
and the GPU take similar times for execution, which is 
the point of inflection on the graph. The CPU tends to 
be idle before the plots converge and the GPU is not 
utilised beyond the point of convergence. The optimal 
workload is when the CPU performs the analysis for 
220,000 trials and the GPU performs the analysis on the 
remaining 580,000 trials. The analysis on the hybrid 
platform is 25% faster (41 seconds) than the GPU (55 
seconds). 


Figure 11 shows the experimental results obtained 
when the Intel Xeon and Phi Coprocessor are employed 
for performing ARA accounting for both PU and SU us¬ 


ing the Boost, MKL and Beta_nc libraries. Figure pTa 
is the graph showing the performance of ARA when the 
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Figure 10: Times taken on the i7 CPU and Tesla C2075 
GPU for applying PU and SU in ARA for varying trials 



number of threads are varied from 1 to 60 (i.e., execu¬ 
tion of single thread per core). As expected an increase 
in speed is noted with increasing number of threads. 
The fastest time obtained for ARA with SU is less than 
102 seconds on 60 threads using the MKL and Beta_nc 
libraries; the combination of the Boost and Beta_nc li¬ 
braries is also comparable. ARA accounting only for 
PU does not enjoy the same degree of speedup, since it 
consists mostly of memory operations; 18 seconds are 
required on 60 threads and 24 seconds on 30 threads. 
Nearly 82% percent more time is required for SU using 
60 threads. Despite the increased potential for vector- 
ization on the Phi coprocessor, the overhead in the vec¬ 
torized functions causes it to be nearly 20% slower than 
the non-vectorised library. 

Figure pTb| is the graph showing the performance of 
ARA incorporating PU and SU when the number of 
threads are varied from 60 to 240 (i.e., execution of 
multiple threads per core) on the Phi coprocessor. The 
speedup steadily decreases when multiple threads are 
executed on each core. For example, when two threads 
are executing per core the best time is 72 seconds com¬ 
pared to 62.88 seconds when three threads are executing 
per core. 

Figure shows the time taken for applying PU and 
SU on the Xeon and Phi. Both times scale linearly as 
expected. The time taken for applying SU is over twice 
the time taken for performing ARA with PU. The best 
library combination of MKL and Betamc takes approx¬ 
imately 3.5 times longer for SU than for applying pri¬ 
mary uncertainty. 

5.3. Discussion 

Figure is a graph that summarises the key results 
from the experimental study. The set of three bars rep¬ 
resents the time taken for (i) fetching Events from mem- 



Figure 11: Time taken for ARA with PU and SU using 
Intel Xeon and Intel Phi Coprocessor 



Figure 12: Times taken for applying PU and SU in ARA 
for varying number of trials on Xeon and Phi 


16 




































































ory and for look up of Loss Sets in memory, (ii) apply¬ 
ing Financial Terms and performing other computations 
in ARA, and (iii) applying secondary uncertainty on the 
sequential implementation on the CPU platforms and 
the parallel implementations on the multi-core CPU, 
many-core GPU and the Phi Co-processor. In each case, 
parameters specific to the implementation, such as the 
number of threads, were set to the best value identified 
during the experiments. 

In the parallel implementations for ARA only con¬ 
sidering PU, a speedup of 2.3x, of 20x and of 90x is 
achieved on the CPU, the GPU and the Phi respec¬ 
tively when compared to the corresponding sequential 
implementations (the parallel GPU results are compared 
against the sequential i7 results and the parallel Phi re¬ 
sults are compared against the sequential Xeon results). 
A speedup of nearly 12x is achieved in the overall time 
for the implementation on the GPU and a speed up of 
2lx is obtained for the overall time for the implemen¬ 
tation on the Phi in contrast to the respective sequential 
implementations on the CPU. For applying SU, multiple 
threading on the CPU is nearly between six to twenty 
two times faster than the sequential implementations. 
For the numeric computations on the GPU, an accel¬ 
eration of approximately 26x is achieved over the se¬ 
quential implementation. Limited memory bandwidth 
is a bottleneck in the CPU that results in approximately 
27% and 53% of the time spent for fetching Events and 
for look up of Loss Sets in memory for the sequential 
and parallel implementation on the CPU respectively. 
While the time for fetching Events and for look up of 
Loss Sets in memory have been significantly lowered 
on the GPU, 39% of the total time is still used to this 
end. In the case of the Phi, this time is lowered to 3 
seconds as against 16 seconds required by the GPU. 

On hybrid platforms, the combination of GPU and 
CPU outperform the Phi and CPU. The GPU and CPU 
perform ARA with PU and SU in only 41 seconds com¬ 
pared to 63 seconds required by Phi and CPU; the Phi 
and CPU is 50% slower than the GPU and CPU plat¬ 
form, which is nearly 16x faster than the baseline imple¬ 
mentation on i7. On individual accelerator platforms, 
without considering the time required for transferring 
data to the device from the host, the Phi only requires 
42 seconds against 55 seconds required by the GPU; the 
Phi is 1.3 times faster than the GPU for ARA and 2lx 
faster than the baseline implementation on Xeon. 

The Phi in the native mode seems faster than the com¬ 
bination of the Xeon and the Phi as a hybrid system. 
However, this is not always a practical solution since 
data needs to be copied offline to the memory of the Phi 
and is time consuming. In the case of ARA, copying 


data of 800,000 trials required more than three hours 
which is not included in the execution time. Although 
the execution takes only 42 seconds once all data resides 
in the Phi memory, it is less likely that most application 
can use the Phi in the native mode. The Phi is not in¬ 
tended to be used natively, but as a coprocessor. Hence, 
in the timing results the time taken for copying data to 
the Phi is not included. 

In sequential or parallel implementations on CPU, ac¬ 
celerator and hybrid platforms require nearly 50% or 
more time for applying SU in ARA. The majority of this 
time is required by the computations of the inverse CDF 
of the beta distribution. This calls for not only the devel¬ 
opment of fast methods to apply secondary uncertainty 
in risk analytics, but also the development of fast meth¬ 
ods for the underlying statistical functions. Fast meth¬ 
ods have been implemented for computing the inverse 
CDF of the symmetrical beta distribution l|49l which 
considers one shape parameter, but there are minimal 
implementations of fast assymetrical beta distribution 
that takes two shape parameters as required for the sec¬ 
ondary uncertainty methodology reported in this paper. 

6. Conclusions 

Hardware accelerators, such as GPUs and Phis, are 
attractive for accelerating applications that require HPC 
solutions, since they are low budget platforms. Risk an¬ 
alytics is one such domain that can hugely benefit from 
hardware accelerators. In this research, we set out to 
investigate which hardware accelerator is more benefi¬ 
cial for high-performance risk analytics. ARA, a Monte 
Carlo like simulation that considers primary and sec¬ 
ondary uncertainties for estimating portfolio wide risk is 
chosen as a candidate application. Parallel algorithms to 
speedup the risk simulation are implemented on multi¬ 
core CPUs, hardware accelerators and hybrid platforms 
comprising CPUs and accelerators. Experimental stud¬ 
ies are pursued on CPUs such as the Intel i7 and the 
Intel Xeon, and on accelerators, such as the NVIDIA 
GPU and the Intel Phi, for evaluating the performance 
of ARA. 

Both the Phi and the GPU are competing hardware 
accelerators and useful in different contexts for high- 
performance risk analytics. Overall, the experimental 
study indicates that for individual performance, when 
execution times are only considered without taking into 
account the data transfer times, the Phi outperforms the 
GPU by 23% achieving a speed up of 2lx over a base¬ 
line implementation; for a typical simulation of 800,000 
trials with each trial comprising 1,000 events and 16 ex¬ 
tended ELTs, the Phi individually requires only 42 sec- 
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Figure 13: Time taken for fetching Events from memory and for look up of Loss Sets in memory, applying Financial 
Terms and performing other computations in PU, and applying SU on sequential and parallel implementations of ARA 


onds against 55 seconds required by the GPU. However, 
the Phi as an independent processor is impractical in a 
large number of cases and is best suited as a coprocessor 
in a hybrid platform. The Xeon and Phi hybrid platform 
is 50% slower than the combination of i7 and the GPU. 
For the typical simulation on hybrid platforms, the i7 
and GPU combination takes only 41 seconds, which is 
16x faster than a baseline implementation, compared to 
63 seconds required by Xeon and Phi. 
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